Exploratory Data Analysis

In this notebook, we explore and visualize the structure of the dataset. To run the code, both the kernel and the vector representations of the IMDB dataset must be present in "kernels/without_labels".

Looking at a small sample of Ego-graphs

This small sample shows us there are not obvious qualitative differences to a human, when looking at the shape of the graphs. There are actors in both genres with sparse, dense, big and small graphs, which indicates this problem may be difficult to solve.

Comparing Genres by Mean Degree

Each set of graphs has a similar distribution, but action is marginally shifted to the right, indicating that there are more shared films between actors on average. This indicates it may be hard to separate out the genres with purely unsupervised techniques.

Number of PCA Components:

The number of PCA components does not change the NMI with the ground truth labels very much at all. The ranking of performance remains the same as you change the number of components. Graphlet and Shortest Path clearly don't line up much with the genre.

From now on will use 10 PCA Components for the analysis with K means, for speed purposes, without sacrificing much "accuracy".

Comparing Kernels

K means on different kernels results in quite different clusterings.

Results

In general, the WL kernels pick a graph which is fully connected as one cluster centroid, and one which is sparsely connected as another. This makes sense as with any number of iterations of a WL kernel, a fully connected graph does not change.

The Graphlet kernel picks two cluster centroids where the graphlets look similar.

The Shortest Path kernel has one cluster centroid with short paths, and one with longer paths on average.

NMI between different Kernels K-means clusterings

WL1, WL2 and WL3 have very similar clusterings, and then so do WL4 and WL5. Graphlet and Shortest Path are not similar to each other or the WL kernels.

However the most important finding is that none of the kernel clusterings match up well at all with the ground truth labelings. This indicates that the ground truth labelings are not very related to the graph structure itself, and perhaps more node or edge labellings would be important.

DBSCAN

ERiC is a density-based clustering based on DBSCAN, so it would be interesting whether vanilla DBSCAN already leads to meaningful results. Since DBSCAN labels points for which no cluster assignments are found as noise, it can also be used for outlier detection.

NMIs and comparison to ground truth labels

We compare the clustering results of different vector representations to each other and to the ground truth labels (action and romance).

NMI comparison for different kernels and parameters

As seen below, neither of the DBSCAN clusterings corresponds to the ground truth labels, with normalized mutual information scores of below 0.15. The clusterings based on the WL2 to WL5 representations are most similar to the ground truth labels. For higher eps, the similarity to the ground truth seems to be lower especially for the graphlet, shortestpath and WL1 representations.

A higher number of min_samples decreases the normalized mutual information score between the clusterings and the ground truth labels. A possible reason for this decline is that a higher number of min_samples implies that more points are assigned to the set of noise points instead of a cluster.

The heatmaps below show the normalized mutual information scores between the different representations for differnt DBSCAN parameters. As seen before, WL2 to WL5 reesult in very similar clusterings. The similarity between the shortestpath and graphlet representation seems to depend on the value of eps.

Outlier detection with DBSCAN

The orange points in the visualizations represent outliers, while the blue points represent points assigned to clusters. As suspected, setting a number as the parameter for min_samples seems to lead to more noise points.

Graph visualizations for selected dimensions

Outlier detection with quantiles

The following section finds outliers with a simple statistical method: Data points with multiple dimension below the first or above the third quantile are defined as outliers. The minimum number of dimensions is selected arbitrarily to get the desired number of outliers.

Below, you can see those graphs which were selected as outliers in all vector representations. Similar to the outliers found by DBSCAN, these graphs seem to be have more interconnections and a more complex structure than the ego-graphs shown in the section "Comparing kernels". Interestingly, most of the overall outliers have a label of 0 (since there index is below 500), implying that they belong to the "romance" genre.

In summary, the clustering results with DBSCAN show some relation to each other but barely any relation with the ground truth labels. Most outliers found with statistical means are part of the "romance" genre, which might be useful for fine-tuning a classification model on the dataset..